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ABSTRACT 

Lipreading in combination with an acoustic indication of voice 
fundamental frequency (F0) has been shown to greatly enhance 
word recognition accuracy with sentence stimuli [1], A possible 
explanation for this effect is that F0 delivers information for 
consonantal voicing. In Experiment 1, we showed with a 
computational model how voicing information affects the 
uniqueness of lipread words in a large phonemically transcribed 
machine-readable lexicon. In Experiment 2, the same 
computational methods were used to simulate the results 
obtained by McGrath and Summerfield [2] for lipreading with 
and without acoustic F0. The model failed to account in full for 
the behaviorally observed enhancements. It is suggested that 
lexical biasing in word recognition can account for the 
difference between the model and the behavioral results. (This 
work was supported by NIH Grant DC-00695.) 

1. INTRODUCTION 

Even under optimal viewing conditions, not all phonetic 
information is visible to the lipreader. As a result, the 
information needed to perceive some phonemic distinctions is 
not available. For example, lipreaders may not perceive any 
distinctions among productions of the consonants /b/, /p/, and 
/m/. The loss of phonemic distinctions results in reduced 
uniqueness for words in the lexicon [3,4], Thus, understanding 
spoken language is difficult for many deaf individuals. In order 
to enhance lipreading by deaf individuals, investigators have 
sought signals that can be transduced by an impaired auditory 
system or by an alternate sensory system such as touch. 

One such signal is voice fundamental frequency (F0). F0 is 
generated at the glottis, which is invisible to the lipreader. 
Several experiments have been reported in which simple 
acoustic stimuli composed of pulses generated as a function of 
F0 were presented to enhance lipreading. In these experiments, 
adults with normal hearing improved as much as 40 percentage 
points over lipreading alone when they lipread with the F0 
supplement [1,2]. 

The observed enhancement is typically attributed to the fact that 
F0 characteristics contribute to perception at several different 
linguistic levels, including consonantal voicing distinctions [5], 
lexical stress (e.g., CONvert versus con VERT ), sentential stress 



[6], word boundaries [7] and syntactic information [6]. However, 
estimates of the contribution made by the different 
characteristics associated with F0 to the overall enhancement 
effect have not been obtained. Waldstein and Boothroyd [8] have 
suggested that as much as one half of the observed enhancement 
may be due to the information conveyed about the presence of 
consonantal voicing. The current computational experiments 
examined the contribution of consonantal voicing to the 
uniqueness of words in the lexicon. 

1.1 Sources of Voicing in Lipreading 

Because laryngeal vibrations are invisible, consonantal voicing 
is frequently hypothesized to be completely absent from the 
information available to the lipreader. Although this assertion 
may be true for lipread consonant-vowel nonsense syllables, it is 
not true for lipread words or sentences [9]. 

One source of consonantal voicing information is the preceding 
vowel duration for post- vocalic consonants [10]. Vowel 
durations are longer for voiced final consonants than for 
voiceless consonants. Durational cues are potentially available to 
the lipreader and are likely responsible for the partial visibility 
of final consonant voicing reported by Hnath-Chisolm and 
Kishon-Rabin [11]. Another source of consonantal voicing 
information is the distribution of phoneme patterns in the words 
of the language. For example, /b/ is distinguished from /p/ or /m/ 
in the English word "bought,” because “pought” and “mought” 
are not words. Thus, the voicing distinction is available by virtue 
of the lexicon’s structure. Of course, the voicing distinction is 
not disambiguated via the lexicon’s structure for all words (e.g. 
“bat”) [4], 

2. EXPERIMENT 1 

The goal of Experiment 1 was to model effects on the structure 
of the lexicon brought about when visible speech is enhanced 
with consonantal voicing information. Computational lexical 
modeling techniques [4,12,13] were applied to obtain frequency- 
weighted estimates of word uniqueness for lipreading alone (LA) 
and lipreading with voicing information (L+V). 

2.1 Methods 

Lexical modeling was applied as follows: First, a phonemically 
transcribed machine-readable lexical database was selected to 




serve as a representative sample of the words in the language. 
Along with a phonemic transcription, each word in the database 
had an associated estimate of its frequency of occurrence in the 
language. Second, transcription rules were defined on the basis 
of measures of phonetic similarity. The transcription rules were 
in the form of single symbol substitutions for all phonemes in 
phonemic equivalence classes. A phonemic equivalence class 
comprised the set of phonemes rendered equivalent by the loss 
of phonetic distinctiveness. (For example, when /b/, /p/, and /m/ 
are phonetically similar, a transcription rule is defined to 
transcribe each occurrence of /b/, /p/, and Iml into one symbol 
representing the equivalence class.) Third, the lexical database 
was then transcribed by applying the transcription rules. Lexical 
equivalence classes were formed by collapsing across 
identically transcribed words. (For example, under the phoneme 
equivalence class definition given above, “pat” and “bat” would 
both fall into the same lexical equivalence class.) Finally, 
metrics were computed to compare the distribution of patterns in 
the newly transcribed lexicon with the distribution of patterns in 
the original lexicon. 

Lexical Database. The method described above was applied to 
the PhLex database [14], which comprises the 20,000 most 
frequent words in [15] and the 12,118 words in [16]. All of 
PhLex’ s entries have transcriptions that include stress and 
syllabification markers, and estimates of frequency of usage. 
When word frequency information was not available for an 
entry, frequency was set to 1. All frequencies were log- 
transformed (base 10). 

Transcription Rules. Sets of transcription rules were developed 
using estimates of visual phonetic similarity obtained from 
separate behaviorally obtained consonant and vowel confusion 
matrices [17], These estimates were submitted to separate 
hierarchical cluster analyses using the average linkage between 
groups method. Because perceptual data were not available for 
h j t )/, theoretical estimates of similarity were employed. 
Vowels and consonants were assumed to be maximally 
dissimilar, except for the consonant /j/ which was included in the 
vowel confusion matrix. (See Table 1.) The transcription rules 
applied to 17 vowels, and 23 consonants. 

Table 1 lists the sets of phonemic equivalence classes that were 
used for the transcription rules for the LA condition. The table 
shows that the number of equivalence classes increased at the 
same rate for consonants and vowels, and that the increases 
followed the hierarchical clustering results for between 2 and 19 
clusters. The range between 10 and 19 clusters best 
approximates the phoneme equivalence classes estimated for 
lipreaders [4]. 

A second group of transcription rule sets was generated for the 
L+V condition. This was accomplished by modifying each 
equivalence class such that voiceless consonants never appeared 
in the same equivalence class with a voiced or nasal consonant. 
For example, the phonemic equivalence class [d,t,s,z] was 
separated into two new equivalence classes, [t,s] and { d,z } . 



Number of Phonemic 
Equivalence Classes 


Phonemic Equivalence Classes 


19 


{u,u,3r} [o,au] [i,i } { e,e } [as] { oi } 
[o] {ai,3,a,A,j}{b,p,m} [f,v] {1} 
fn,k] [q,g] [h] [d] [t,s,z] [w,r] 
[5,6] {.f,tfe,d3l 


12 


{u,u,3r} [o,au] {i,i,e,e,as} { oi } 
[o,ai,3,a,A,j] [b,p,m] [f,v] 
[l,n,k,g,g,h] [d,t,s,z] [w,r] {5,0} 
{.Wf,3,d3} 


10 


{u,u,3r} {o,au} {i,i,e,e,as} 
{oi,o,ai,3,a,A,j][b,p,m} {f,v} 
{l,n,k,g,g,h,d,t,s,z} {w,r} {5,0} 
{.f.tf,3,d3} 


2 


{u,u,3r,o,au,i,i,e,e,as,oi,o,ai,3,a,A,j} 

{b,p,m,f,v,l,n,k,t),g,h,d,t,s,z,w,r,S,0,J’, 

tJkTd.3} 



Table 1. Equivalence classes comprising transcription rules. 



Application of Transcription Rules. Transcription rule sets for 
both LA and L+V were applied to the PhLex database. Two 
words were considered equivalent only when their phonemic, 
and stress and syllabification patterns were identical. For 
example, the noun “convert” and the verb “convert” were not 
considered equivalent. Thus, these analyses assumed accurate 
perception of lexical stress and syllabification. 

Quantitative Analysis. Two commonly employed metrics were 
computed to quantitatively analyze the distributions of patterns 
in the transcribed lexicon [12,13]. Frequency-weighted percent 
words unique was computed as 

%WU = — xlOO , (1) 

Fl 

where Fu is the sum of the frequencies of occurrence for 
unique words in the transcribed lexicon, and Fl is the sum of 
frequencies of occurrence of words in the original lexicon. The 
frequency-weighted metric estimates the extent to which unique 
words are encountered in everyday language. 

Frequency-weighted expected class size is computed as 
" £ Fa 

ECS = via—, (2) 

a = i Fl 

where ue is the total number of lexical equivalence classes, la is 
the number of words in equivalence class a. Fa is the sum of 
frequencies of occurrence of words in equivalence class a, and 
Fl is the sum of the frequencies of occurrence of words in the 
lexicon. The frequency-weighted metric estimates the average 
size of the equivalence classes encountered in everyday 
language. 





2.2 Results and Discussion 

Table 2 shows that consonantal voicing substantially increases 
the percent unique words for every set of transcription rules. The 
largest enhancement was 15 percentage points, when the number 
of phonemic equivalence classes was 10 for LA and 15 for L+V. 
We estimated that 10 equivalence classes is typical of relatively 
inaccurate hearing lipreaders. The table also shows that many 
words are not unique under the L+V condition, although a 
substantial reduction in the frequency-weighted expected class 
size occurs with consonantal voicing (L+V). 



Number of 
Phonemic 
Equivalence 
Classes 
LA L+V 


Percent Unique 
Words 
LA L+V 


Expected Class 
Size 

LA L+V 


2 


3 


7 


18 


422.6 


86.6 


10 


15 


43 


58 


14.1 


4.3 


12 


18 


54 


62 


5.1 


2.2 


19 


25 


76 


85 


1.6 


1.2 



Table 2. Percent unique words and expected class size as a 
function of LA versus L+V and number of phonemic 
equivalence classes. 



3. EXPERIMENT 2 

Experiment 2 was conducted to compare modeled LA and L+V 
with empirically obtained results from McGrath and 
Summerfield’s [2] Experiment 1. In their experiment, the 
number of keywords correct in sentences was measured for LA 
and L+V. In analyzing their data, McGrath and Summerfield 
split their subjects into three groups based on lipreading ability 
(poor, average, and good). They found that the magnitude of 
enhancement increased as a function of lipreading ability. 
Column 4 of Table 3 gives the percent keywords correct for 
poor, average, and good lipreaders in their study (see Figure 1 in 
[2]) in LA and L+V conditions. 

3.1 Methods 

Word set. McGrath and Summerfield employed the Bamford- 
Kowal-Bench (BKB) standard sentence lists [18]. Only the BKB 
keywords were analyzed here, as was the case in [2]. Of the 
1,050 keywords, five were eliminated from the analysis, because 
they did not exist in any form in the PhLex database. Of the 
remaining 1,045 words, morphological changes were required on 
12 words (8 singularizations, 1 pluralization, 4 verb tense 
changes) in order to find appropriate entries in PhLex. Each 
keyword token was counted. Thus, if a word occurred in several 
sentences, it contributed proportionally to the results. 

Procedure. Three different sets of phonemic equivalence classes 
were selected to simulate the three levels of lipreaders in 



McGrath and Summerfield [2], Ten phonemic LA equivalence 
classes (with corresponding 14 L+V equivalence classes) were 
used to model poor lipreaders’ performance; twelve LA 
phonemic equivalence classes (18 L+V) were used to estimate 
average lipreaders’ performance; and nineteen LA phonemic 
equivalence classes (25 L+V) were used to estimate good 
lipreaders’ performance. (See Table 1 for the 10, 12, and 19 LA 
equivalence classes.) These six rule sets were applied to the 
extracted BKB keywords and to the entire PhLex database. 

Quantitative Analysis. A BKB words was counted as 
recognized if its transcribed form was unique in the 
corresponding transcription of PhLex. Percent correct was 
obtained by dividing the total number of transcribed unique 
words by the total number of words in the BKB keyword list. 

Average equivalence class size for BKB words was computed 
under the assumption that subjects selected their response words 
from their total lexicons. Thus, the average equivalence class 
size was computed by summing those equivalence class sizes for 
the classes that contained BKB words in the transcribed PhLex 
and dividing by the total number of BKB words. 

3.2 Results 



Number of 
Phonemic 
Equivalence 
Classes 
LA L+V 



Modeled 
Percent 
Correct 
LA L+V 



Average 
Equivalence 
Class Size 
LA L+V 



Percent 
Correct 
McGrath & 
Summerfield 
LA L+V 



10 



15 



12 



18 



34.9 10.2 



Poor 9 



11 



12 

19 



18 

25 



18 26 12.1 

45 60 2.6 



4.5 Avg. 21 38 

1.6 Good 42 69 



Table 3. Modeled percent words correct, average equivalence 
class size, and percent words correct [2], as a function of LA 
versus L+V and number of phoneme equivalence classes. 



The results of Experiment 2 are shown in Table 3. The table can 
be used to obtain the modeled enhancement (L+V minus LA) for 
BKB words, which is approximately half that of the 
enhancement reported by [2] for average and good lipreaders. 
For example, the modeled enhancement for good lipreaders was 
15 percentage points (60 minus 45), but the behaviorally 
obtained enhancement was 27 percentage points (69 minus 42). 
On the other hand, McGrath and Summerfield’s poor lipreaders 
scarcely benefited from F0, whereas in the model the 
enhancement was 6 percentage points. The results on average 
equivalence class size show that consonantal voicing resulted in 
substantial reduction in class size. 



4. GENERAL DISCUSSION 

Experiment 1 showed that consonantal voicing fails to 
disambiguate a majority of visually ambiguous words. At the 





same time, the L+V transcriptions do substantially reduce the 
number of words in equivalence classes. 

Experiment 2 showed that for the small set of keywords in the 
BKB sentence lists, L+V results in an increase in unique words 
over the LA transcriptions. However, the results of Experiment 2 
did not account fully for the McGrath and Summerfield data, 
which showed greater enhancements with the acoustic FO 
supplement in behavioral tests. 

Several different factors could contribute to the McGrath- 
Summerfield results and [1,8], including effects due to syntactic 
and semantic levels of processing. However, we are intrigued 
with another possible contributor to word identification with FO, 
which is suggested by the foregoing analyses of word 
equivalence class size. Contemporary models of word 
recognition incorporate frequency weighted decision rules [19]. 
Under conditions of ambiguity, the decision rule results in the 
selection of the most frequent word. Thus, accurate word 
recognition can occur as a function of both perceptual 
uniqueness and lexical biasing. Under the lipreading with FO 
condition, lexical biasing could resolve the remaining ambiguity 
of words, particularly in sentence sets designed to sample 
frequent words, as in the BKB sentences. 
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